Suffix Tree
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed
trie In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes def ...
containing all the
suffixes In linguistics, a suffix is an affix which is placed after the Stem (linguistics), stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the Grammatical conjugation ...
of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations. The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a
substring In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...
in S, locating a substring if a certain number of mistakes are allowed, locating matches for a
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
pattern etc. Suffix trees also provide one of the first linear-time solutions for the
longest common substring problem In computer science, the longest common substring problem is to find a longest string that is a substring of two or more strings. The problem may have multiple solutions. Applications include data deduplication and plagiarism detection. Examples ...
. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.


History

The concept was first introduced by . Rather than the suffix S ..n/math>, Weiner stored in his trie the ''prefix identifier'' for each position, that is, the shortest string starting at i and occurring only once in S. His ''Algorithm D'' takes an uncompressed trie for S +1..n/math> and extends it into a trie for S ..n/math>. This way, starting from the trivial trie for S ..n/math>, a trie for S ..n/math> can be built by n - 1 successive calls to Algorithm D; however, the overall run time is O(n^2). Weiner's ''Algorithm B'' maintains several auxiliary data structures, to achieve an over all run time linear in the size of the constructed trie. The latter can still be O(n^2) nodes, e.g. for S = a^n b^n a^n b^n \$ . Weiner's ''Algorithm C'' finally uses compressed tries to achieve linear overall storage size and run time.
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
subsequently characterized the latter as "Algorithm of the Year 1973". The text book reproduced Weiner's results in a simplified and more elegant form, introducing the term ''position tree''. was the first to build a (compressed) trie of all suffixes of S. Although the suffix starting at i is usually longer than the prefix identifier, their path representations in a compressed trie do not differ in size. On the other hand, McCreight could dispense with most of Weiner's auxiliary data structures; only suffix links remained. further simplified the construction. He provided the first online-construction of suffix trees, now known as
Ukkonen's algorithm In computer science, Ukkonen's algorithm is a linear-time, online algorithm for constructing suffix trees, proposed by Esko Ukkonen in 1995. The algorithm begins with an implicit suffix tree containing the first character of the string. Then it ste ...
, with running time that matched the then fastest algorithms. These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of O(n\log n) in general. gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees and
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by as ...
s, for example, in external memory, compressed, succinct, etc.


Definition

The suffix tree for the string S of length n is defined as a tree such that: * The tree has exactly n leaves numbered from 1 to n. * Except for the root, every
internal node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...
has at least two children. * Each edge is labelled with a non-empty substring of S. * No two edges starting out of a node can have string-labels beginning with the same character. * The string obtained by concatenating all the string-labels found on the path from the root to leaf i spells out suffix S ..n/math>, for i from 1 to n. Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most ''n'' −  1 such nodes, and ''n'' + (''n'' − 1) + 1 = 2''n'' nodes in total (''n'' leaves, ''n'' − 1 internal non-root nodes, 1 root). Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string \chi\alpha, where \chi is a single character and \alpha is a string (possibly empty), it has a suffix link to the internal node representing \alpha. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree. A
generalized suffix tree In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings D=S_1,S_2,\dots,S_d of total length n, it is a Patricia tree containing all n suffixes of the strings. It is mostly used in bioinforma ...
is a suffix tree made for a set of strings instead of a single string. It represents all suffixes from this set of strings. Each string must be terminated by a different termination symbol.


Functionality

A suffix tree for a string S of length n can be built in \Theta(n) time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constant-sized alphabets). For larger alphabets, the running time is dominated by first
sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar pro ...
the letters to bring them into a range of size O(n); in general, this takes O(n\log n) time. The costs below are given under the assumption that the alphabet is constant. Assume that a suffix tree has been built for the string S of length n, or that a
generalised suffix tree In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings D=S_1,S_2,\dots,S_d of total length n, it is a Patricia tree containing all n suffixes of the strings. It is mostly used in bioinformat ...
has been built for the set of strings D=\ of total length n=n_1+n_2+\cdots+n_K. You can: * Search for strings: ** Check if a string P of length m is a substring in O(m) time. ** Find the first occurrence of the patterns P_1,\dots,P_q of total length m as substrings in O(m) time. ** Find all z occurrences of the patterns P_1,\dots,P_q of total length m as substrings in O(m + z) time. ** Search for a
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
''P'' in time expected
sublinear In linear algebra, a sublinear function (or functional as is more often used in functional analysis), also called a quasi-seminorm or a Banach functional, on a vector space X is a real-valued function with only some of the properties of a seminorm. ...
in n. ** Find for each suffix of a pattern P, the length of the longest match between a prefix of P \dots m/math> and a substring in D in \Theta(m) time. This is termed the matching statistics for P. * Find properties of the strings: ** Find the longest common substrings of the string S_i and S_j in \Theta(n_i + n_j) time. ** Find all maximal pairs, maximal repeats or supermaximal repeats in \Theta(n + z) time. ** Find the
Lempel–Ziv LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
decomposition in \Theta(n) time. ** Find the longest repeated substrings in \Theta(n) time. ** Find the most frequently occurring substrings of a minimum length in \Theta(n) time. ** Find the shortest strings from \Sigma that do not occur in D, in O(n + z) time, if there are z such strings. ** Find the shortest substrings occurring only once in \Theta(n) time. ** Find, for each i, the shortest substrings of S_i not occurring elsewhere in D in \Theta(n) time. The suffix tree can be prepared for constant time
lowest common ancestor In graph theory and computer science, the lowest common ancestor (LCA) (also called least common ancestor) of two nodes and in a Tree (graph theory), tree or directed acyclic graph (DAG) is the lowest (i.e. deepest) node that has both and a ...
retrieval between nodes in \Theta(n) time. One can then also: * Find the longest common prefix between the suffixes S_i ..n_i/math> and S_j ..n_j/math> in \Theta(1). * Search for a pattern ''P'' of length ''m'' with at most ''k'' mismatches in O(k n + z) time, where ''z'' is the number of hits. * Find all z maximal
palindrome A palindrome is a word, number, phrase, or other sequence of symbols that reads the same backwards as forwards, such as the words ''madam'' or ''racecar'', the date and time ''11/11/11 11:11,'' and the sentence: "A man, a plan, a canal – Panam ...
s in \Theta(n), or \Theta(g n) time if gaps of length g are allowed, or \Theta(k n) if k mismatches are allowed. * Find all z
tandem repeats Tandem repeats occur in DNA when a pattern of one or more nucleotides is repeated and the repetitions are directly adjacent to each other. Several protein domains also form tandem repeats within their amino acid primary structure, such as armadil ...
in O(n \log n + z), and ''k''-mismatch tandem repeats in O(k n \log (n/k) + z). * Find the longest common substrings to at least k strings in D for k=2,\dots,K in \Theta(n) time. * Find the
longest palindromic substring In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring o ...
of a given string (using the generalized suffix tree of the string and its reverse) in linear time.


Applications

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search,
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
and other application areas. Primary applications include: *
String search In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger stri ...
, in ''O(m)'' complexity, where ''m'' is the length of the sub-string (but with initial ''O(n)'' time required to build the suffix tree for the string) * Finding the longest repeated substring * Finding the longest common substring * Finding the longest
palindrome A palindrome is a word, number, phrase, or other sequence of symbols that reads the same backwards as forwards, such as the words ''madam'' or ''racecar'', the date and time ''11/11/11 11:11,'' and the sentence: "A man, a plan, a canal – Panam ...
in a string Suffix trees are often used in
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
applications, searching for patterns in DNA or
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
; they can be used to find repeated data, and can be used for the sorting stage of the
Burrows–Wheeler transform The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated c ...
. Variants of the LZW compression schemes use suffix trees ( LZSS). A suffix tree is also used in
suffix tree clustering Suffix Tree Clustering, often abbreviated as STC is an approach for clustering that uses suffix trees. A suffix tree cluster keeps track of all n-grams of any given length to be inserted into a set word string, while simultaneously allowing diffe ...
, a
data clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
algorithm used in some search engines.First introduced by .


Implementation

If each node and edge can be represented in \Theta(1) space, the entire tree can be represented in \Theta(n) space. The total length of all the strings on all of the edges in the tree is O(n^2), but each edge can be stored as the position and length of a substring of , giving a total space usage of \Theta(n) computer words. The worst-case space usage of a suffix tree is seen with a
fibonacci word A Fibonacci word is a specific sequence of binary digits (or symbols from any two-letter alphabet). The Fibonacci word is formed by repeated concatenation in the same way that the Fibonacci numbers are formed by repeated addition. It is a para ...
, giving the full 2n nodes. An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using
linked list In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...
s called sibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other implementations with efficient running time properties use
hash map In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps Unique key, keys to Value (computer science), values. A hash table uses a hash ...
s, sorted or unsorted
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
s (with array doubling), or balanced search trees. We are interested in: * The cost of finding the child on a given character. * The cost of inserting a child. * The cost of enlisting all children of a node (divided by the number of children in the table below). Let be the size of the alphabet. Then you have the following costs: : \begin & \text & \text & \text \\ \hline \text & O(\sigma) & \Theta(1) & \Theta(1) \\ \text & O(\log \sigma) & \Theta(1) & \Theta(1) \\ \text & \Theta(1) & \Theta(1) & O(\sigma) \\ \text & O(\log \sigma) & O(\log \sigma) & O(1) \\ \text & O(\log \sigma) & O(\sigma) & O(1) \\ \text & O(1) & O(1) & O(1) \end The insertion cost is amortised, and that the costs for hashing are given for perfect hashing. The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations. The
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by as ...
reduces this requirement to a factor of 8 (for array including LCP values built within 32-bit address space and 8-bit characters.) This factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in some
UNIX-like A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...
systems, see
wchar_t The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. ...
) on 32-bit systems. Researchers have continued to find smaller indexing structures.


Parallel construction

Various parallel algorithms to speed up suffix tree construction have been proposed. Recently, a practical parallel algorithm for suffix tree construction with O(n)
work Work may refer to: * Work (human activity), intentional activity people perform to support themselves, others, or the community ** Manual labour, physical work done by humans ** House work, housework, or homemaking ** Working animal, an animal t ...
(sequential time) and O(\log^2 n)
span Span may refer to: Science, technology and engineering * Span (unit), the width of a human hand * Span (engineering), a section between two intermediate supports * Wingspan, the distance between the wingtips of a bird or aircraft * Sorbitan ester ...
has been developed. The algorithm achieves good parallel scalability on shared-memory multicore machines and can index the
human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
– approximately 3 GB – in under 3 minutes using a 40-core machine.


External construction

Though linear, the memory usage of a suffix tree is significantly higher than the actual size of the sequence collection. For a large text, construction may require external memory approaches. There are theoretical results for constructing suffix trees in external memory. The algorithm by is theoretically optimal, with an I/O complexity equal to that of sorting. However the overall intricacy of this algorithm has prevented, so far, its practical implementation. On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours. The state of the art methods are TDD,. TRELLIS,. DiGeST,. and B2ST.. TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes. However, these methods cannot handle efficiently collections of sequences exceeding 3GB. DiGeST performs significantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours. . All these methods can efficiently build suffix trees for the case when the tree does not fit in main memory, but the input does. The most recent method, B2ST, scales to handle inputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16GB RAM. On a simple Linux cluster with 16 nodes (4GB RAM per node), ERA can index the entire human genome in less than 9 minutes.


See also

*
Suffix automaton In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix auto ...


Notes


References

*. *. *. *. *. *. *. *. *. *. *. *. *. *. *. * *. *. *. *. *. *.


External links


Suffix Trees
by
Sartaj Sahni Professor Sartaj Kumar Sahni (born July 22, 1949, in Pune, India) is a computer scientist based in the United States, and is one of the pioneers in the field of data structures. He is a distinguished professor in the Department of Computer and I ...

NIST's Dictionary of Algorithms and Data Structures: Suffix Tree

Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice
application of suffix trees in the BWT
Theory and Practice of Succinct Data Structures
C++ implementation of a compressed suffix tree * Ukkonen's Suffix Tree Implementation in
Part 1Part 2Part 3Part 4Part 5Part 6

Online Demo: Ukkonen's Suffix Tree Visualization
{{DEFAULTSORT:Suffix Tree Trees (data structures) Substring indices String data structures Computer science suffixes